Skip to article frontmatterSkip to article content

Policy Gradient Methods

The previous chapter showed how to handle continuous action spaces in fitted Q-iteration by amortizing action selection with policy networks. Methods like NFQCA, DDPG, TD3, and SAC all learn both a Q-function and a policy, using the Q-function to guide policy improvement. This chapter explores a different approach: optimizing policies directly without maintaining explicit value functions.

Direct policy optimization offers several advantages. First, it naturally handles stochastic policies, which can be essential for partially observable environments or problems requiring explicit exploration. Second, it avoids the detour through value function approximation, which may introduce errors that compound during policy extraction. Third, for problems with simple policy classes but complex value landscapes, directly searching in policy space can be more efficient than searching in value space.

The foundation of policy gradient methods rests on computing gradients of expected returns with respect to policy parameters. This chapter develops the mathematical machinery needed for this computation, starting with general derivative estimation techniques from stochastic optimization, then specializing to reinforcement learning settings, and finally examining variance reduction methods that make these estimators practical.

Derivative Estimation for Stochastic Optimization

Consider optimizing an objective that involves an expectation:

J(θ)=Exp(x;θ)[f(x,θ)]J(\theta) = \mathbb{E}_{x \sim p(x;\theta)}[f(x,\theta)]

For concreteness, consider a simple example where xN(θ,1)x \sim \mathcal{N}(\theta,1) and f(x,θ)=x2θf(x,\theta) = x^2\theta. The derivative we seek is:

ddθJ(θ)=ddθx2θp(x;θ)dx\frac{d}{d\theta}J(\theta) = \frac{d}{d\theta}\int x^2\theta p(x;\theta)dx

While we can compute this exactly for the Gaussian example, this is often impossible for more general problems. We might then be tempted to approximate our objective using samples:

J(θ)1Ni=1Nf(xi,θ),xip(x;θ)J(\theta) \approx \frac{1}{N}\sum_{i=1}^N f(x_i,\theta), \quad x_i \sim p(x;\theta)

Then differentiate this approximation:

ddθJ(θ)1Ni=1Nθf(xi,θ)\frac{d}{d\theta}J(\theta) \approx \frac{1}{N}\sum_{i=1}^N \frac{\partial}{\partial \theta}f(x_i,\theta)

However, this naive approach ignores that the samples themselves depend on θ\theta. The correct derivative requires the product rule:

ddθJ(θ)=θ[f(x,θ)p(x;θ)]dx=[fθp(x;θ)+f(x,θ)p(x;θ)θ]dx\frac{d}{d\theta}J(\theta) = \int \frac{\partial}{\partial \theta}[f(x,\theta)p(x;\theta)]dx = \int \left[\frac{\partial f}{\partial \theta}p(x;\theta) + f(x,\theta)\frac{\partial p(x;\theta)}{\partial \theta}\right]dx

While the first term could be numerically integrated using Monte Carlo, the second one cannot as it is not in the form of an expectation.

To transform our objective so that the Monte Carlo estimator for the objective could be differentiated directly while ensuring that the resulting derivative is unbiased, there are two main solutions: a change of measure, or a change of variables.

The Likelihood Ratio Method

One solution comes from rewriting our objective using any distribution q(x)q(x):

J(θ)=f(x,θ)p(x;θ)q(x)q(x)dx=Exq(x)[f(x,θ)p(x;θ)q(x)]J(\theta) = \int f(x,\theta)\frac{p(x;\theta)}{q(x)}q(x)dx = \mathbb{E}_{x \sim q(x)}\left[f(x,\theta)\frac{p(x;\theta)}{q(x)}\right]

Write this more functionally by defining:

J(θ)=Exq(x)[h(x,θ)],h(x,θ)f(x,θ)p(x;θ)q(x)J(\theta) = \mathbb{E}_{x \sim q(x)}[h(x,\theta)] , \enspace h(x,\theta) \equiv f(x,\theta)\frac{p(x;\theta)}{q(x)}

When we differentiate JJ, we must take the partial derivative of hh with respect to its second argument:

ddθJ(θ)=Exq(x)[hθ(x,θ)]=Exq(x)[f(x,θ)θp(x;θ)q(x)+p(x;θ)q(x)fθ(x,θ)]\frac{d}{d\theta}J(\theta) = \mathbb{E}_{x \sim q(x)}\left[\frac{\partial h}{\partial \theta}(x,\theta)\right] = \mathbb{E}_{x \sim q(x)}\left[f(x,\theta)\frac{\partial}{\partial \theta}\frac{p(x;\theta)}{q(x)} + \frac{p(x;\theta)}{q(x)}\frac{\partial f}{\partial \theta}(x,\theta)\right]

The so-called “score function” derivative estimator is obtained for the choice of q(x)=p(x;θ)q(x) = p(x;\theta), where the ratio simplifies to 1 and its derivative becomes the score function:

ddθJ(θ)=Exp(x;θ)[f(x,θ)logp(x,θ)θ+f(x,θ)θ]\frac{d}{d\theta}J(\theta) = \mathbb{E}_{x \sim p(x;\theta)}\left[f(x,\theta)\frac{\partial \log p(x,\theta)}{\partial \theta} + \frac{\partial f(x,\theta)}{\partial \theta}\right]

The Reparameterization Trick

An alternative approach eliminates the θ\theta-dependence in the sampling distribution by expressing xx through a deterministic transformation of the noise:

x=g(ϵ,θ),ϵq(ϵ)x = g(\epsilon,\theta), \quad \epsilon \sim q(\epsilon)

Therefore if we want to sample from some target distribution p(x;θ)p(x;\theta), we can do so by first sampling from a simple base distribution q(ϵ)q(\epsilon) (like a standard normal) and then transforming those samples through a carefully chosen function gg. If g(,θ)g(\cdot,\theta) is invertible, the change of variables formula tells us how these distributions relate:

p(x;θ)=q(g1(x,θ))detg1(x,θ)x=q(ϵ)detg(ϵ,θ)ϵ1p(x;\theta) = q(g^{-1}(x,\theta))\left|\det\frac{\partial g^{-1}(x,\theta)}{\partial x}\right| = q(\epsilon)\left|\det\frac{\partial g(\epsilon,\theta)}{\partial \epsilon}\right|^{-1}

For example, if we want to sample from any multivariate Gaussian distributions with covariance matrix Σ\Sigma and mean μ\mu, it suffices to be able to sample from a standard normal noise and compute the linear transformation:

x=μ+Σ1/2ϵ,ϵN(0,I)x = \mu + \Sigma^{1/2}\epsilon, \quad \epsilon \sim \mathcal{N}(0,I)

where Σ1/2\Sigma^{1/2} is the matrix square root obtained via Cholesky decomposition. In the univariate case, this transformation is simply:

x=μ+σϵ,ϵN(0,1)x = \mu + \sigma \epsilon, \quad \epsilon \sim \mathcal{N}(0,1)

where σ=σ2\sigma = \sqrt{\sigma^2} is the standard deviation (square root of the variance).

Common Examples of Reparameterization

The Truncated Normal Distribution

When we need samples constrained to an interval [a,b][a,b], we can use the truncated normal distribution. To sample from it, we transform uniform noise through the inverse cumulative distribution function (CDF) of the standard normal:

x=Φ1(uΦ(b)+(1u)Φ(a)),uUniform(0,1)x = \Phi^{-1}(u\Phi(b) + (1-u)\Phi(a)), \quad u \sim \text{Uniform}(0,1)

Here:

The resulting samples follow a normal distribution restricted to [a,b][a,b], with the density properly normalized over this interval.

The Kumaraswamy Distribution

When we need samples in the unit interval [0,1], a natural choice might be the Beta distribution. However, its inverse CDF doesn’t have a closed form. Instead, we can use the Kumaraswamy distribution as a convenient approximation, which allows for a simple reparameterization:

x=(1(1uα)1/β),uUniform(0,1)x = (1-(1-u^{\alpha})^{1/\beta}), \quad u \sim \text{Uniform}(0,1)

where:

The Kumaraswamy distribution has density:

f(x;α,β)=αβxα1(1xα)β1,x[0,1]f(x; \alpha, \beta) = \alpha\beta x^{\alpha-1}(1-x^{\alpha})^{\beta-1}, \quad x \in [0,1]
The Gumbel-Softmax Distribution

When sampling from a categorical distribution with probabilities {πi}\{\pi_i\}, one approach uses Gumbel(0,1)\text{Gumbel}(0,1) noise combined with the argmax of log-perturbed probabilities:

argmaxi(logπi+gi),giGumbel(0,1)\text{argmax}_i(\log \pi_i + g_i), \quad g_i \sim \text{Gumbel}(0,1)

This approach, known in machine learning as the Gumbel-Max trick, relies on sampling Gumbel noise from uniform random variables through the transformation gi=log(log(ui))g_i = -\log(-\log(u_i)) where uiUniform(0,1)u_i \sim \text{Uniform}(0,1). To see why this gives us samples from the categorical distribution, consider the probability of selecting category ii:

P(argmaxj(logπj+gj)=i)=P(logπi+gi>logπj+gj for all ji)=P(gigj>logπjlogπi for all ji)\begin{align*} P(\text{argmax}_j(\log \pi_j + g_j) = i) &= P(\log \pi_i + g_i > \log \pi_j + g_j \text{ for all } j \neq i) \\ &= P(g_i - g_j > \log \pi_j - \log \pi_i \text{ for all } j \neq i) \end{align*}

Since the difference of two Gumbel random variables follows a logistic distribution, gigjLogistic(0,1)g_i - g_j \sim \text{Logistic}(0,1), and these differences are independent for different jj (due to the independence of the original Gumbel variables), we can write:

P(argmaxj(logπj+gj)=i)=jiP(gigj>logπjlogπi)=jiπiπi+πj=πi\begin{align*} P(\text{argmax}_j(\log \pi_j + g_j) = i) &= \prod_{j \neq i} P(g_i - g_j > \log \pi_j - \log \pi_i) \\ &= \prod_{j \neq i} \frac{\pi_i}{\pi_i + \pi_j} = \pi_i \end{align*}

The last equality requires some additional algebra to show, but follows from the fact that these probabilities must sum to 1 over all ii.

While we have shown that the Gumbel-Max trick gives us exact samples from a categorical distribution, the argmax operation isn’t differentiable. For stochastic optimization problems of the form:

Exp(x;θ)[f(x)]=EϵGumbel(0,1)[f(g(ϵ,θ))]\mathbb{E}_{x \sim p(x;\theta)}[f(x)] = \mathbb{E}_{\epsilon \sim \text{Gumbel}(0,1)}[f(g(\epsilon,\theta))]

we need gg to be differentiable with respect to θ\theta. This leads us to consider a continuous relaxation where we replace the hard argmax with a temperature-controlled softmax:

zi=exp((logπi+gi)/τ)jexp((logπj+gj)/τ)z_i = \frac{\exp((\log \pi_i + g_i)/\tau)}{\sum_j \exp((\log \pi_j + g_j)/\tau)}

As τ0\tau \to 0, this approximation approaches the argmax:

limτ0exp(xi/τ)jexp(xj/τ)={1if xi=maxjxj0otherwise\lim_{\tau \to 0} \frac{\exp(x_i/\tau)}{\sum_j \exp(x_j/\tau)} = \begin{cases} 1 & \text{if } x_i = \max_j x_j \\ 0 & \text{otherwise} \end{cases}

The resulting distribution over the probability simplex is called the Gumbel-Softmax (or Concrete) distribution. The temperature parameter τ\tau controls the discreteness of our samples: smaller values give samples closer to one-hot vectors but with less stable gradients, while larger values give smoother gradients but more diffuse samples.

Numerical Analysis of Gradient Estimators

Let us examine the behavior of our three gradient estimators for the stochastic optimization objective:

J(θ)=ExN(θ,1)[x2θ]J(\theta) = \mathbb{E}_{x \sim \mathcal{N}(\theta,1)}[x^2\theta]

To get an analytical expression for the derivative, first note that we can factor out θ\theta to obtain J(θ)=θE[x2]J(\theta) = \theta\mathbb{E}[x^2] where xN(θ,1)x \sim \mathcal{N}(\theta,1). By definition of the variance, we know that Var(x)=E[x2](E[x])2\text{Var}(x) = \mathbb{E}[x^2] - (\mathbb{E}[x])^2, which we can rearrange to E[x2]=Var(x)+(E[x])2\mathbb{E}[x^2] = \text{Var}(x) + (\mathbb{E}[x])^2. Since xN(θ,1)x \sim \mathcal{N}(\theta,1), we have Var(x)=1\text{Var}(x) = 1 and E[x]=θ\mathbb{E}[x] = \theta, therefore E[x2]=1+θ2\mathbb{E}[x^2] = 1 + \theta^2. This gives us:

J(θ)=θ(1+θ2)J(\theta) = \theta(1 + \theta^2)

Now differentiating with respect to θ\theta using the product rule yields:

ddθJ(θ)=1+3θ2\frac{d}{d\theta}J(\theta) = 1 + 3\theta^2

For concreteness, we fix θ=1.0\theta = 1.0 and analyze samples drawn using Monte Carlo estimation with batch size 1000 and 1000 independent trials. Evaluating at θ=1\theta = 1 gives us ddθJ(θ)θ=1=1+3(1)2=4\frac{d}{d\theta}J(\theta)\big|_{\theta=1} = 1 + 3(1)^2 = 4, which serves as our ground truth against which we compare our estimators:

  1. First, we consider the naive estimator that incorrectly differentiates the Monte Carlo approximation:

    g^naive(θ)=1Ni=1Nxi2\hat{g}_{\text{naive}}(\theta) = \frac{1}{N}\sum_{i=1}^N x_i^2

    For xN(1,1)x \sim \mathcal{N}(1,1), we have E[x2]=θ2+1=2.0\mathbb{E}[x^2] = \theta^2 + 1 = 2.0 and E[g^naive]=2.0\mathbb{E}[\hat{g}_{\text{naive}}] = 2.0. We should therefore expect a bias of about -2 in our experiment.

  2. Then we compute the score function estimator:

    g^SF(θ)=1Ni=1N[xi2θ(xiθ)+xi2]\hat{g}_{\text{SF}}(\theta) = \frac{1}{N}\sum_{i=1}^N \left[x_i^2\theta(x_i - \theta) + x_i^2\right]

    This estimator is unbiased with E[g^SF]=4\mathbb{E}[\hat{g}_{\text{SF}}] = 4

  3. Finally, through the reparameterization x=θ+ϵx = \theta + \epsilon where ϵN(0,1)\epsilon \sim \mathcal{N}(0,1), we obtain:

    g^RT(θ)=1Ni=1N[2θ(θ+ϵi)+(θ+ϵi)2]\hat{g}_{\text{RT}}(\theta) = \frac{1}{N}\sum_{i=1}^N \left[2\theta(\theta + \epsilon_i) + (\theta + \epsilon_i)^2\right]

    This estimator is also unbiased with E[g^RT]=4\mathbb{E}[\hat{g}_{\text{RT}}] = 4.

Source
%config InlineBackend.figure_format = 'retina'
import jax
import jax.numpy as jnp
import matplotlib.pyplot as plt

# Apply book style
try:
    import scienceplots
    plt.style.use(['science', 'notebook'])
except (ImportError, OSError):
    pass  # Use matplotlib defaults

key = jax.random.PRNGKey(0)

# Define the objective function f(x,θ) = x²θ where x ~ N(θ, 1)
def objective(x, theta):
    return x**2 * theta

# Naive Monte Carlo gradient estimation
@jax.jit
def naive_gradient_batch(key, theta):
    samples = jax.random.normal(key, (1000,)) + theta
    # Use jax.grad on the objective with respect to theta
    grad_fn = jax.grad(lambda t: jnp.mean(objective(samples, t)))
    return grad_fn(theta)

# Score function estimator (REINFORCE)
@jax.jit
def score_function_batch(key, theta):
    samples = jax.random.normal(key, (1000,)) + theta
    # f(x,θ) * ∂logp(x|θ)/∂θ + ∂f(x,θ)/∂θ
    # score function for N(θ,1) is (x-θ)
    score = samples - theta
    return jnp.mean(objective(samples, theta) * score + samples**2)

# Reparameterization gradient
@jax.jit
def reparam_gradient_batch(key, theta):
    eps = jax.random.normal(key, (1000,))
    # Use reparameterization x = θ + ε, ε ~ N(0,1)
    grad_fn = jax.grad(lambda t: jnp.mean(objective(t + eps, t)))
    return grad_fn(theta)

# Run trials
n_trials = 1000
theta = 1.0
true_grad = 1 + 3 * theta**2

keys = jax.random.split(key, n_trials)
naive_estimates = jnp.array([naive_gradient_batch(k, theta) for k in keys])
score_estimates = jnp.array([score_function_batch(k, theta) for k in keys])
reparam_estimates = jnp.array([reparam_gradient_batch(k, theta) for k in keys])

# Create violin plots with individual points
plt.figure(figsize=(12, 6))
data = [naive_estimates, score_estimates, reparam_estimates]
colors = ['#ff9999', '#66b3ff', '#99ff99']

parts = plt.violinplot(data, showextrema=False)
for i, pc in enumerate(parts['bodies']):
    pc.set_facecolor(colors[i])
    pc.set_alpha(0.7)

# Add box plots
plt.boxplot(data, notch=True, showfliers=False)

# Add true gradient line
plt.axhline(y=true_grad, color='r', linestyle='--', label='True Gradient')

plt.xticks([1, 2, 3], ['Naive', 'Score Function', 'Reparam'])
plt.ylabel('Gradient Estimate')
plt.title(f'Gradient Estimators (θ={theta}, true grad={true_grad:.2f})')
plt.grid(True, alpha=0.3)
plt.legend()

# Print statistics
methods = {
    'Naive': naive_estimates,
    'Score Function': score_estimates, 
    'Reparameterization': reparam_estimates
}

for name, estimates in methods.items():
    bias = jnp.mean(estimates) - true_grad
    variance = jnp.var(estimates)
    print(f"\n{name}:")
    print(f"Mean: {jnp.mean(estimates):.6f}")
    print(f"Bias: {bias:.6f}")
    print(f"Variance: {variance:.6f}")
    print(f"MSE: {bias**2 + variance:.6f}")

Naive:
Mean: 2.000417
Bias: -1.999583
Variance: 0.005933
MSE: 4.004266

Score Function:
Mean: 3.996162
Bias: -0.003838
Variance: 0.057295
MSE: 0.057309

Reparameterization:
Mean: 3.999940
Bias: -0.000060
Variance: 0.017459
MSE: 0.017459
<Figure size 1200x600 with 1 Axes>

The numerical experiments corroborate our theory. The naive estimator consistently underestimates the true gradient by 2.0, though it maintains a relatively small variance. This systematic bias would make it unsuitable for optimization despite its low variance. The score function estimator corrects this bias but introduces substantial variance. While unbiased, this estimator would require many samples to achieve reliable gradient estimates. Finally, the reparameterization trick achieves a much lower variance while remaining unbiased. While this experiment is for didactic purposes only, it reproduces what is commonly found in practice: that when applicable, the reparameterization estimator tends to perform better than the score function counterpart.

Score Function Methods in Reinforcement Learning

The score function estimator from the previous section applies directly to reinforcement learning. Since it requires only the ability to evaluate and differentiate logπw(as)\log \pi_{\boldsymbol{w}}(a|s), it works with any differentiable policy, including discrete action spaces where reparameterization is unavailable. It requires no model of the environment dynamics.

Let G(τ)t=0Tr(st,at)G(\tau) \equiv \sum_{t=0}^T r(s_t, a_t) be the sum of undiscounted rewards in a trajectory τ\tau. The stochastic optimization problem we face is to maximize:

J(w)=Eτp(τ;w)[G(τ)]J(\boldsymbol{w}) = \mathbb{E}_{\tau \sim p(\tau;\boldsymbol{w})}[G(\tau)]

where τ=(s0,a0,s1,a1,...)\tau = (s_0,a_0,s_1,a_1,...) is a trajectory and G(τ)G(\tau) is the total return. Applying the score function estimator, we get:

wJ(w)=wEτ[G(τ)]=Eτ[G(τ)wlogp(τ;w)]=Eτ[G(τ)wt=0Tlogπw(atst)]=Eτ[G(τ)t=0Twlogπw(atst)]\begin{align*} \nabla_{\boldsymbol{w}}J(\boldsymbol{w}) &= \nabla_{\boldsymbol{w}}\mathbb{E}_{\tau}[G(\tau)] \\ &= \mathbb{E}_{\tau}\left[G(\tau)\nabla_{\boldsymbol{w}}\log p(\tau;\boldsymbol{w})\right] \\ &= \mathbb{E}_{\tau}\left[G(\tau)\nabla_{\boldsymbol{w}}\sum_{t=0}^T\log \pi_{\boldsymbol{w}}(a_t|s_t)\right] \\ &= \mathbb{E}_{\tau}\left[G(\tau)\sum_{t=0}^T\nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t|s_t)\right] \end{align*}

We have eliminated the need to know the transition probabilities in this estimator since the probability of a trajectory factorizes as:

p(τ;w)=p(s0)t=0Tπw(atst)p(st+1st,at)p(\tau;\boldsymbol{w}) = p(s_0)\prod_{t=0}^T \pi_{\boldsymbol{w}}(a_t|s_t)p(s_{t+1}|s_t,a_t)

Therefore, only the policy depends on w\boldsymbol{w}. When taking the logarithm of this product, we get a sum where all the w\boldsymbol{w}-independent terms vanish. The final estimator samples trajectories under the distribution p(τ;w)p(\tau; \boldsymbol{w}) and computes:

wJ(w)1Ni=1N[G(τ(i))t=0Twlogπw(at(i)st(i))]\nabla_{\boldsymbol{w}}J(\boldsymbol{w}) \approx \frac{1}{N}\sum_{i=1}^N\left[G(\tau^{(i)})\sum_{t=0}^T\nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t^{(i)}|s_t^{(i)})\right]

This is a direct application of the score function estimator. However, we rarely use this form in practice and instead make several improvements to further reduce the variance.

Leveraging Conditional Independence

Given the Markov property of the MDP, rewards rkr_k for k<tk < t are conditionally independent of action ata_t given the history ht=(s0,a0,...,st1,at1,st)h_t = (s_0,a_0,...,s_{t-1},a_{t-1},s_t). This allows us to only need to consider future rewards when computing policy gradients.

wJ(w)=Eτ[t=0Twlogπw(atst)k=0Trk]=Eτ[t=0Twlogπw(atst)(k=0t1rk+k=tTrk)]=Eτ[t=0Twlogπw(atst)k=tTrk]\begin{align*} \nabla_{\boldsymbol{w}}J(\boldsymbol{w}) &= \mathbb{E}_{\tau}\left[\sum_{t=0}^T\nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t|s_t)\sum_{k=0}^T r_k\right] \\ &= \mathbb{E}_{\tau}\left[\sum_{t=0}^T\nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t|s_t)\left(\sum_{k=0}^{t-1} r_k + \sum_{k=t}^T r_k\right)\right] \\ &= \mathbb{E}_{\tau}\left[\sum_{t=0}^T\nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t|s_t)\sum_{k=t}^T r_k\right] \end{align*}

The conditional independence assumption means that the term Eτ[t=0Twlogπw(atst)k=0t1rk]\mathbb{E}_{\tau}\left[\sum_{t=0}^T\nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t|s_t)\sum_{k=0}^{t-1} r_k \right] vanishes. To see this, factor the trajectory distribution as:

p(τ)=p(s0,...,st,a0,...,at1)πw(atst)p(st+1,...,sT,at+1,...,aTst,at)p(\tau) = p(s_0,...,s_t,a_0,...,a_{t-1}) \pi_{\boldsymbol{w}}(a_t|s_t) p(s_{t+1},...,s_T,a_{t+1},...,a_T|s_t,a_t)

We can now re-write a single term of this summation as:

Eτ[wlogπw(atst)k=0t1rk]=Es0:t,a0:t1[k=0t1rkEat[wlogπw(atst)]]\mathbb{E}_{\tau}\left[\nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t|s_t)\sum_{k=0}^{t-1} r_k\right] = \mathbb{E}_{s_{0:t},a_{0:t-1}}\left[\sum_{k=0}^{t-1} r_k \, \mathbb{E}_{a_t}\left[\nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t|s_t)\right]\right]

The inner expectation is zero because

Eat[wlogπw(atst)]=wlogπw(atst)πw(atst)dat=wπw(atst)πw(atst)πw(atst)dat=wπw(atst)dat=wπw(atst)dat=w1=0\begin{align*} \mathbb{E}_{a_t}\left[\nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t|s_t)\right] &= \int \nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t|s_t)\pi_{\boldsymbol{w}}(a_t|s_t)da_t \\ &= \int \frac{\nabla_{\boldsymbol{w}}\pi_{\boldsymbol{w}}(a_t|s_t)}{\pi_{\boldsymbol{w}}(a_t|s_t)}\pi_{\boldsymbol{w}}(a_t|s_t)da_t \\ &= \int \nabla_{\boldsymbol{w}}\pi_{\boldsymbol{w}}(a_t|s_t)da_t \\ &= \nabla_{\boldsymbol{w}}\int \pi_{\boldsymbol{w}}(a_t|s_t)da_t \\ &= \nabla_{\boldsymbol{w}}1 = 0 \end{align*}

The Monte Carlo estimator becomes:

wJ(w)1Ni=1N[t=0Twlogπw(at(i)st(i))k=tTrk(i)]\nabla_{\boldsymbol{w}}J(\boldsymbol{w}) \approx \frac{1}{N}\sum_{i=1}^N\left[\sum_{t=0}^T\nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t^{(i)}|s_t^{(i)})\sum_{k=t}^T r_k^{(i)}\right]

This gives us the REINFORCE algorithm:

The benefit of this estimator compared to the naive one (which would weight each score function by the full trajectory return G(τ)G(\tau)) is that it generally has less variance. This variance reduction arises from the conditional independence structure we exploited: past rewards do not depend on future actions. More formally, this estimator is an instance of a variance reduction technique known as the Extended Conditional Monte Carlo Method.

The Surrogate Loss Perspective

The algorithm above computes a gradient estimate g^\hat{g} explicitly. In practice, implementations using automatic differentiation frameworks take a different approach: they define a surrogate loss whose gradient matches the REINFORCE estimator. For a single trajectory, consider:

Lsurrogate(w)=t=0Tlogπw(atst)GtL_{\text{surrogate}}(\boldsymbol{w}) = -\sum_{t=0}^T \log \pi_{\boldsymbol{w}}(a_t|s_t) \, G_t

where the returns GtG_t and actions ata_t are treated as fixed constants (detached from the computation graph). Taking the gradient with respect to w\boldsymbol{w}:

wLsurrogate=t=0Twlogπw(atst)Gt\nabla_{\boldsymbol{w}} L_{\text{surrogate}} = -\sum_{t=0}^T \nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t|s_t) \, G_t

Minimizing this surrogate loss via gradient descent yields the same update as maximizing expected return via REINFORCE. The negative sign converts our maximization problem into a minimization suitable for standard optimizers.

This surrogate loss is not the expected return J(w)J(\boldsymbol{w}) we are trying to maximize. It is a computational device that produces the correct gradient at the current parameter values. Several properties distinguish it from a true loss function:

  1. It changes each iteration. The returns GtG_t come from trajectories sampled under the current policy. After updating w\boldsymbol{w}, we must collect new trajectories and construct a new surrogate loss.

  2. Its value is not meaningful. Unlike supervised learning where the loss measures prediction error, the numerical value of LsurrogateL_{\text{surrogate}} has no direct interpretation. Only its gradient matters.

  3. It is valid only locally. The surrogate loss provides the correct gradient only at the parameters used to collect the data. Moving far from those parameters invalidates the gradient estimate.

This perspective explains why policy gradient code often looks different from the pseudocode above. Instead of computing g^\hat{g} explicitly, implementations define the surrogate loss and call loss.backward():

# Surrogate loss implementation (single trajectory)
log_probs = [policy.log_prob(a_t, s_t) for s_t, a_t in trajectory]
returns = compute_returns(rewards)
surrogate_loss = -sum(lp * G for lp, G in zip(log_probs, returns))
surrogate_loss.backward()  # computes REINFORCE gradient
optimizer.step()

Variance Reduction via Control Variates

Recall that the REINFORCE gradient estimator, after leveraging conditional independence, takes the form:

wJ(w)1Ni=1N[t=0Twlogπw(at(i)st(i))k=tTrk(i)]\nabla_{\boldsymbol{w}}J(\boldsymbol{w}) \approx \frac{1}{N}\sum_{i=1}^N\left[\sum_{t=0}^T\nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t^{(i)}|s_t^{(i)})\sum_{k=t}^T r_k^{(i)}\right]

This is a sum over trajectories and timesteps. The gradient contribution at timestep tt of trajectory ii is:

wlogπw(at(i)st(i))k=tTrk(i)\nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t^{(i)}|s_t^{(i)})\sum_{k=t}^T r_k^{(i)}

While unbiased, this estimator suffers from high variance because the return k=tTrk\sum_{k=t}^T r_k can vary significantly across trajectories even for the same state-action pair. The control variate method provides a principled way to reduce this variance.

General Control Variate Theory

For a general estimator ZZ of some quantity μ=E[Z]\mu = \mathbb{E}[Z], and a control variate CC with known expectation E[C]=0\mathbb{E}[C]=0, we can construct:

Zcv=ZαCZ_{\text{cv}} = Z - \alpha C

This remains unbiased since E[Zcv]=E[Z]αE[C]=E[Z]\mathbb{E}[Z_{\text{cv}}] = \mathbb{E}[Z] - \alpha\mathbb{E}[C] = \mathbb{E}[Z]. The variance is:

Var(Zcv)=Var(Z)+α2Var(C)2αCov(Z,C)\text{Var}(Z_{\text{cv}}) = \text{Var}(Z) + \alpha^2\text{Var}(C) - 2\alpha\text{Cov}(Z,C)

The 2αCov(Z,C)-2\alpha\text{Cov}(Z,C) term is what enables variance reduction. If ZZ and CC are positively correlated, we can choose α>0\alpha > 0 to make this term negative and large in magnitude, reducing the overall variance. However, the α2Var(C)\alpha^2\text{Var}(C) term grows quadratically with α\alpha, so if we make α\alpha too large, this quadratic term will eventually dominate and the variance will increase rather than decrease. The variance as a function of α\alpha is a parabola opening upward, with a unique minimum. Setting ddαVar(Zcv)=0\frac{d}{d\alpha}\text{Var}(Z_{\text{cv}}) = 0 gives:

α=Cov(Z,C)Var(C)\alpha^* = \frac{\text{Cov}(Z,C)}{\text{Var}(C)}

This is the coefficient from ordinary least squares regression: we predict the estimator ZZ using the control variate CC as the predictor. Since E[C]=0\mathbb{E}[C] = 0, the linear model is ZE[Z]+αCZ \approx \mathbb{E}[Z] + \alpha^* C, where α\alpha^* is the OLS slope coefficient. The control variate estimator Zcv=ZαCZ_{\text{cv}} = Z - \alpha^* C computes the residual: the part of ZZ that cannot be explained by CC.

Substituting α\alpha^* into the variance formula yields:

Var(Zcv)=Var(Z)[Cov(Z,C)]2Var(C)=(1R2)Var(Z)\text{Var}(Z_{\text{cv}}) = \text{Var}(Z) - \frac{[\text{Cov}(Z,C)]^2}{\text{Var}(C)} = (1 - R^2) \text{Var}(Z)

where R2=[Cov(Z,C)]2Var(Z)Var(C)R^2 = \frac{[\text{Cov}(Z,C)]^2}{\text{Var}(Z)\text{Var}(C)} is the coefficient of determination from regressing ZZ on CC. The variance reduction is R2Var(Z)R^2 \text{Var}(Z): the better CC predicts ZZ, the more variance we eliminate.

Application to REINFORCE

In the reinforcement learning setting, our REINFORCE gradient estimator is a sum over timesteps: t=0TZt\sum_{t=0}^T Z_t where each ZtZ_t represents the gradient contribution at timestep tt. We apply control variates separately to each term. Since Var(tZt)=tVar(Zt)+tsCov(Zt,Zs)\text{Var}(\sum_t Z_t) = \sum_t \text{Var}(Z_t) + \sum_{t \neq s} \text{Cov}(Z_t, Z_s), reducing the variance of each ZtZ_t reduces the total variance, though we do not explicitly address the cross-timestep covariance terms.

For a given trajectory at state sts_t, the gradient contribution at time tt is:

Zt=wlogπw(atst)k=tTrkZ_t = \nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t|s_t)\sum_{k=t}^T r_k

This is the product of the score function wlogπw(atst)\nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t|s_t) and the return-to-go k=tTrk\sum_{k=t}^T r_k. We can subtract any state-dependent function b(st)b(s_t) from the return without introducing bias, as long as b(st)b(s_t) does not depend on ata_t. This is because:

Eatπw(st)[wlogπw(atst)b(st)]=b(st)Eat[wlogπw(atst)]=0\mathbb{E}_{a_t \sim \pi_{\boldsymbol{w}}(\cdot|s_t)}\left[\nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t|s_t)b(s_t)\right] = b(s_t)\mathbb{E}_{a_t}\left[\nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t|s_t)\right] = 0

where the last equality follows from the score function identity (36).

We can now define our control variate as:

Ct=wlogπw(atst)b(st)C_t = \nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t|s_t) \cdot b(s_t)

where b(st)b(s_t) is a baseline function that depends only on the state. This satisfies E[Ctst]=0\mathbb{E}[C_t|s_t] = 0. Our control variate estimator becomes:

Zt,cv=ZtCt=wlogπw(atst)(k=tTrkb(st))Z_{t,\text{cv}} = Z_t - C_t = \nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t|s_t)\left(\sum_{k=t}^T r_k - b(s_t)\right)

The optimal baseline b(st)b^*(s_t) minimizes the variance. To find it, consider the scalar parameter case for simplicity. Write g(at)wlogπw(atst)g(a_t) \equiv \nabla_w \log \pi_w(a_t|s_t) and Gt=k=tTrkG_t = \sum_{k=t}^T r_k. We want to minimize:

b(st)=argminbVaratπw(st)[g(at)(Gtb)]b^*(s_t) = \arg\min_{b} \text{Var}_{a_t \sim \pi_{\boldsymbol{w}}(\cdot|s_t)}\left[g(a_t)(G_t - b)\right]

Since the mean does not depend on bb, minimizing the variance is equivalent to minimizing the second moment E[g(at)2(Gtb)2st]\mathbb{E}[g(a_t)^2(G_t - b)^2|s_t]. Expanding and taking the derivative with respect to bb gives:

b(st)=Eatst[g(at)2Gt]Eatst[g(at)2]b^*(s_t) = \frac{\mathbb{E}_{a_t|s_t}\left[g(a_t)^2 G_t\right]}{\mathbb{E}_{a_t|s_t}\left[g(a_t)^2\right]}

For vector-valued parameters w\boldsymbol{w}, we minimize a scalar proxy such as the trace of the covariance matrix, which yields the same formula with wlogπw(atst)2\|\nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t|s_t)\|^2 in place of g(at)2g(a_t)^2:

b(st)=Eatst[wlogπw(atst)2Gt]Eatst[wlogπw(atst)2]b^*(s_t) = \frac{\mathbb{E}_{a_t|s_t}\left[\|\nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t|s_t)\|^2 G_t\right]}{\mathbb{E}_{a_t|s_t}\left[\|\nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t|s_t)\|^2\right]}

This is the exact optimal baseline: a weighted average of returns where the weights are the squared norms of the score function. In practice, we treat the squared norm as roughly constant across actions at a given state, which leads to the simpler and widely used choice:

b(st)E[Gtst]=vπw(st)b(s_t) \approx \mathbb{E}[G_t|s_t] = v^{\pi_{\boldsymbol{w}}}(s_t)

With this approximation, the variance-reduced gradient contribution at timestep tt becomes:

Zcv,t=wlogπw(atst)(k=tTrkvπw(st))Z_{\text{cv},t} = \nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t|s_t)\left(\sum_{k=t}^T r_k - v^{\pi_{\boldsymbol{w}}}(s_t)\right)

The term in parentheses is exactly the advantage function: Aπw(st,at)=qπw(st,at)vπw(st)A^{\pi_{\boldsymbol{w}}}(s_t, a_t) = q^{\pi_{\boldsymbol{w}}}(s_t, a_t) - v^{\pi_{\boldsymbol{w}}}(s_t), where the Q-function is approximated by the Monte Carlo return k=tTrk\sum_{k=t}^T r_k. The full gradient estimate for a trajectory is then the sum over all timesteps:

g^=t=0TZcv,t=t=0Twlogπw(atst)(Gtvπw(st))\hat{g} = \sum_{t=0}^T Z_{\text{cv},t} = \sum_{t=0}^T \nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t|s_t)\left(G_t - v^{\pi_{\boldsymbol{w}}}(s_t)\right)

In practice, we do not have access to the true value function and must learn it. Unlike the methods in the amortization chapter, where we learned value functions to approximate the optimal Q-function, here our goal is policy evaluation: estimating the value of the current policy πw\pi_{\boldsymbol{w}}. The same function approximation techniques apply, but we target vπwv^{\pi_{\boldsymbol{w}}} rather than vv^*. The simplest approach is to regress from states to Monte Carlo returns, learning what Williams (1992) called a “baseline”:

When implementing this algorithm nowadays, we always use mini-batching to make full use of our GPUs. Therefore, a more representative variant for this algorithm would be:

The value function is trained by regressing states directly to their sampled Monte Carlo returns GG. Advantage normalization (step 2.5) is not part of the optimal baseline derivation but improves optimization in practice and is standard in modern implementations.

Generalized Advantage Estimation

The baseline construction gave us a gradient estimator of the form:

wJ(w)1Ni=1Nt=0Twlogπw(at(i)st(i))(Gt(i)v(st(i)))\nabla_{\boldsymbol{w}}J(\boldsymbol{w}) \approx \frac{1}{N}\sum_{i=1}^N \sum_{t=0}^T \nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t^{(i)}|s_t^{(i)}) \left(G_t^{(i)} - v(s_t^{(i)})\right)

where Gt=k=tTrkG_t = \sum_{k=t}^T r_k is the Monte Carlo return from time tt. For each visited state-action pair (st,at)(s_t, a_t), the term in parentheses

A^tMC=Gtv(st)\widehat{A}_t^{\text{MC}} = G_t - v(s_t)

is a Monte Carlo estimate of the advantage Aπ(st,at)=qπ(st,at)vπ(st)A^{\pi}(s_t, a_t) = q^{\pi}(s_t, a_t) - v^{\pi}(s_t). If the baseline equals the true value function, v=vπv = v^{\pi}, then E[A^tMCst,at]=Aπ(st,at)\mathbb{E}[\widehat{A}_t^{\text{MC}} | s_t, a_t] = A^{\pi}(s_t, a_t), so this estimator is unbiased.

However, as an estimator it has two limitations. First, it has high variance because GtG_t depends on all future rewards. Second, it uses the value function only as a baseline, not as a predictor of long-term returns. We essentially discard the information in v(st+1),v(st+2),v(s_{t+1}), v(s_{t+2}), \ldots

GAE addresses these issues by constructing a family of estimators that interpolate between pure Monte Carlo and pure bootstrapping. A parameter λ\lambda controls the bias-variance tradeoff.

Decomposing the Monte Carlo Advantage

Fix a value function v(s)v(s) (not necessarily equal to vπv^{\pi}) and define the one-step residual:

δt=rt+γv(st+1)v(st)\delta_t = r_t + \gamma v(s_{t+1}) - v(s_t)

Start from the Monte Carlo advantage and add and subtract γv(st+1)\gamma v(s_{t+1}):

Gtv(st)=rt+γGt+1v(st)=rt+γv(st+1)v(st)+γ(Gt+1v(st+1))=δt+γ(Gt+1v(st+1))\begin{align*} G_t - v(s_t) &= r_t + \gamma G_{t+1} - v(s_t) \\ &= r_t + \gamma v(s_{t+1}) - v(s_t) + \gamma(G_{t+1} - v(s_{t+1})) \\ &= \delta_t + \gamma(G_{t+1} - v(s_{t+1})) \end{align*}

Applying this decomposition recursively yields:

Gtv(st)=l=0Ttγlδt+lG_t - v(s_t) = \sum_{l=0}^{T-t} \gamma^l \delta_{t+l}

The Monte Carlo advantage is exactly the discounted sum of future residuals. This is an algebraic identity, not an approximation.

The sequence {δt+l}l0\{\delta_{t+l}\}_{l \geq 0} provides incremental corrections to the value function as we move forward in time. The term δt\delta_t depends only on (st,at,st+1)(s_t, a_t, s_{t+1}); δt+1\delta_{t+1} depends on (st+1,at+1,st+2)(s_{t+1}, a_{t+1}, s_{t+2}), and so on. As ll increases, the corrections become more noisy (they depend on more random outcomes) and more sensitive to errors in the value function at later states. Although the full sum is unbiased when v=vπv = v^{\pi}, it can have high variance and can be badly affected by approximation error in vv.

GAE as a Shrinkage Estimator

The decomposition above suggests a family of estimators that downweight residuals farther in the future. Let λ[0,1]\lambda \in [0,1] and define:

Atλ=l=0Tt(γλ)lδt+lA_t^{\lambda} = \sum_{l=0}^{T-t} (\gamma\lambda)^l \delta_{t+l}

This is the generalized advantage estimator AtGAE(γ,λ)A_t^{\text{GAE}(\gamma,\lambda)}.

Two special cases illustrate the extremes. When λ=1\lambda = 1, we recover the Monte Carlo advantage:

Atλ=1=l=0Ttγlδt+l=Gtv(st)A_t^{\lambda=1} = \sum_{l=0}^{T-t} \gamma^l \delta_{t+l} = G_t - v(s_t)

When λ=0\lambda = 0, we keep only the immediate residual:

Atλ=0=δt=rt+γv(st+1)v(st)A_t^{\lambda=0} = \delta_t = r_t + \gamma v(s_{t+1}) - v(s_t)

Intermediate values 0<λ<10 < \lambda < 1 interpolate between these extremes. The influence of δt+l\delta_{t+l} decays geometrically as (γλ)l(\gamma\lambda)^l. The parameter λ\lambda acts as a shrinkage parameter: small λ\lambda shrinks the estimator toward the one-step residual; large λ\lambda allows the estimator to behave more like the Monte Carlo advantage.

If v=vπv = v^{\pi} is the true value function, then E[δtst,at]=Aπ(st,at)\mathbb{E}[\delta_t | s_t, a_t] = A^{\pi}(s_t, a_t) and E[δt+lst,at]=0\mathbb{E}[\delta_{t+l} | s_t, a_t] = 0 for l1l \geq 1. In this case:

E[Atλst,at]=l=0Tt(γλ)lE[δt+lst,at]=Aπ(st,at)\mathbb{E}[A_t^{\lambda} | s_t, a_t] = \sum_{l=0}^{T-t} (\gamma\lambda)^l \mathbb{E}[\delta_{t+l} | s_t, a_t] = A^{\pi}(s_t, a_t)

for all λ[0,1]\lambda \in [0,1]. When the value function is exact, GAE is unbiased regardless of λ\lambda; changing λ\lambda only affects variance.

In practice, we approximate vπv^{\pi} with a function approximator, and the residuals δt+l\delta_{t+l} inherit approximation error. Distant residuals involve multiple applications of the approximate value function and are more contaminated by modeling error. Downweighting them (choosing λ<1\lambda < 1) introduces bias but can reduce variance and limit the impact of those errors.

Mixture of Multi-Step Estimators

Another perspective on GAE comes from multi-step returns. Define the kk-step return from time tt:

Gt(k)=l=0k1γlrt+l+γkv(st+k)G_t^{(k)} = \sum_{l=0}^{k-1} \gamma^l r_{t+l} + \gamma^k v(s_{t+k})

and the corresponding kk-step advantage estimator At(k)=Gt(k)v(st)A_t^{(k)} = G_t^{(k)} - v(s_t). Each At(k)A_t^{(k)} uses kk rewards before bootstrapping; larger kk means more variance but less bootstrapping error.

The GAE estimator can be written as a geometric mixture:

Atλ=(1λ)k=1Ttλk1At(k)A_t^{\lambda} = (1-\lambda) \sum_{k=1}^{T-t} \lambda^{k-1} A_t^{(k)}

GAE is a weighted average of the kk-step advantage estimators, with shorter horizons weighted more heavily when λ\lambda is small.

Using GAE in the Policy Gradient

Once we choose λ\lambda, we plug AtλA_t^{\lambda} in place of Gtv(st)G_t - v(s_t) in the policy gradient estimator:

wJ(w)1Ni=1Nt=0Twlogπw(at(i)st(i))Atλ,(i)\nabla_{\boldsymbol{w}}J(\boldsymbol{w}) \approx \frac{1}{N}\sum_{i=1}^N \sum_{t=0}^T \nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t^{(i)}|s_t^{(i)}) A_t^{\lambda,(i)}

We still use a control variate to reduce variance (the baseline vv), but now we construct the advantage target by smoothing the sequence of residuals {δt}\{\delta_t\} with a geometrically decaying kernel.

For the value function, it is convenient to define the λ\lambda-return:

Gtλ=Atλ+v(st)G_t^{\lambda} = A_t^{\lambda} + v(s_t)

When λ=1\lambda = 1, GtλG_t^{\lambda} reduces to the Monte Carlo return; when λ=0\lambda = 0, it becomes the one-step bootstrapped target rt+γv(st+1)r_t + \gamma v(s_{t+1}).

When λ=1\lambda = 1, this reduces (up to advantage normalization) to the Monte Carlo baseline algorithm earlier in the chapter. When λ=0\lambda = 0, advantages become the one-step residuals δt\delta_t, and the λ\lambda-returns reduce to standard one-step bootstrapped targets.

Actor-Critic as the λ=0\lambda = 0 Limit

The case λ=0\lambda = 0 is particularly simple. The advantage becomes:

Atλ=0=δt=rt+γv(st+1)v(st)A_t^{\lambda=0} = \delta_t = r_t + \gamma v(s_{t+1}) - v(s_t)

and the policy update reduces to:

ww+αwwlogπw(atst)δt\boldsymbol{w} \leftarrow \boldsymbol{w} + \alpha_w \nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t|s_t) \delta_t

while the value update becomes a standard one-step regression toward rt+γv(st+1)r_t + \gamma v(s_{t+1}). This gives the online actor-critic algorithm:

This algorithm was derived by Sutton in his 1984 thesis as an “adaptive heuristic” for temporal credit assignment. In the language of this chapter, it is the λ=0\lambda = 0 member of the GAE family: it uses the most local residual δt\delta_t as both the target for the value function and the advantage estimate for the policy gradient.

Likelihood Ratio Methods in Reinforcement Learning

The score function estimator from the previous section is a special case of the likelihood ratio method where the proposal distribution equals the target distribution. We now consider the general case where they differ.

Recall the likelihood ratio gradient estimator from the beginning of this chapter. For objective J(θ)=Exp(x;θ)[f(x)]J(\theta) = \mathbb{E}_{x \sim p(x;\theta)}[f(x)] and any proposal distribution q(x)q(x):

θJ(θ)=Exq(x)[f(x)ρ(x;q,θ)θlogp(x;θ)]\nabla_\theta J(\theta) = \mathbb{E}_{x \sim q(x)}\left[f(x) \rho(x; q, \theta) \nabla_\theta \log p(x;\theta)\right]

where ρ(x;q,θ)=p(x;θ)q(x)\rho(x; q, \theta) = \frac{p(x;\theta)}{q(x)} is the likelihood ratio. The partial derivative ρθ=ρθlogp\frac{\partial \rho}{\partial \theta} = \rho \nabla_\theta \log p holds because xx is treated as fixed, having been sampled from qq, which does not depend on θ\theta.

In reinforcement learning, let x=τx = \tau be a trajectory, f(τ)=G(τ)f(\tau) = G(\tau) the return, p(τ;w)p(\tau;\boldsymbol{w}) the trajectory distribution under policy πw\pi_{\boldsymbol{w}}, and q(τ)q(\tau) the trajectory distribution under some other policy πq\pi_q. The gradient becomes:

wJ(w)=Eτπq[G(τ)ρ(τ)t=0Twlogπw(atst)]\nabla_{\boldsymbol{w}} J(\boldsymbol{w}) = \mathbb{E}_{\tau \sim \pi_q}\left[G(\tau) \rho(\tau) \sum_{t=0}^T \nabla_{\boldsymbol{w}} \log \pi_{\boldsymbol{w}}(a_t|s_t)\right]

where the trajectory likelihood ratio simplifies because transition probabilities cancel:

ρ(τ)=p(τ;w)q(τ)=t=0Tπw(atst)πq(atst)=t=0Tρt\rho(\tau) = \frac{p(\tau;\boldsymbol{w})}{q(\tau)} = \prod_{t=0}^T \frac{\pi_{\boldsymbol{w}}(a_t|s_t)}{\pi_q(a_t|s_t)} = \prod_{t=0}^T \rho_t

This product of T+1T+1 ratios can become extremely large or small as TT grows, leading to high variance. The temporal structure helps: since past rewards rkr_k for k<tk < t do not depend on action ata_t, we can reduce the full trajectory weight to a per-step weight ρt\rho_t. Combined with a baseline b(st)b(s_t), the practical estimator is:

wJ(w)=Eτπq[t=0Tρtwlogπw(atst)(Gtb(st))]\nabla_{\boldsymbol{w}} J(\boldsymbol{w}) = \mathbb{E}_{\tau \sim \pi_q}\left[\sum_{t=0}^T \rho_t \nabla_{\boldsymbol{w}} \log \pi_{\boldsymbol{w}}(a_t|s_t) (G_t - b(s_t))\right]

This is the gradient of the surrogate objective:

L(w)=Eτπq[t=0TρtAt]L(\boldsymbol{w}) = \mathbb{E}_{\tau \sim \pi_q}\left[\sum_{t=0}^T \rho_t A_t\right]

where At=Gtb(st)A_t = G_t - b(s_t). When πq=πw\pi_q = \pi_{\boldsymbol{w}}, the ratios ρt=1\rho_t = 1 and we recover the score function estimator.

Variance and the Dominance Condition

The ratio ρt=πw(atst)/πq(atst)\rho_t = \pi_{\boldsymbol{w}}(a_t|s_t)/\pi_q(a_t|s_t) is well-behaved only when the two policies are similar. If πw\pi_{\boldsymbol{w}} assigns high probability to an action where πq\pi_q assigns low probability, the ratio explodes. For example, if πq(as)=0.01\pi_q(a|s) = 0.01 and πw(as)=0.5\pi_{\boldsymbol{w}}(a|s) = 0.5, then ρ=50\rho = 50, amplifying any noise in the advantage estimate.

Importance sampling also requires the dominance condition: the support of πw\pi_{\boldsymbol{w}} must be contained in the support of πq\pi_q. If πw(as)>0\pi_{\boldsymbol{w}}(a|s) > 0 but πq(as)=0\pi_q(a|s) = 0, the ratio is undefined. Stochastic policies typically have full support, but the ratio can still become arbitrarily large as πq(as)0\pi_q(a|s) \to 0.

A common use case is to set πq=πwold\pi_q = \pi_{\boldsymbol{w}_{\text{old}}}, a previous version of the policy. This allows reusing data across multiple gradient steps: collect trajectories once, then update w\boldsymbol{w} several times. But each update moves w\boldsymbol{w} further from wold\boldsymbol{w}_{\text{old}}, making the ratios more extreme. Eventually, the gradient signal is dominated by a few samples with large weights.

Proximal Policy Optimization

The solution is to ensure the ratio ρt\rho_t stays close to 1, keeping the new policy close to the behavior policy. Trust Region Policy Optimization (TRPO) achieves this by constraining the KL divergence:

maxwLIS(w)subject toEs[DKL(πwold(s)πw(s))]δ\max_{\boldsymbol{w}} L^{\text{IS}}(\boldsymbol{w}) \quad \text{subject to} \quad \mathbb{E}_s\left[D_{\text{KL}}(\pi_{\boldsymbol{w}_{\text{old}}}(\cdot|s) \| \pi_{\boldsymbol{w}}(\cdot|s))\right] \leq \delta

The KL constraint ensures that the two distributions remain similar, which bounds how extreme the importance weights can become. This is a constrained optimization problem, and one could in principle apply standard methods such as projected gradient descent or augmented Lagrangian approaches (as discussed in the trajectory optimization chapter). TRPO takes a different approach: it uses a second-order Taylor approximation of the KL constraint around the current parameters and solves the resulting trust region subproblem using conjugate gradient methods. This involves computing the Fisher information matrix (the Hessian of the KL divergence), which adds computational overhead.

Proximal Policy Optimization (PPO) achieves similar behavior through a simpler mechanism: rather than constraining the distributions to be similar, it directly clips the ratio ρt\rho_t to prevent it from moving too far from 1. This is a construction-level guarantee rather than an optimization-level constraint. The clipped surrogate objective is:

LCLIP(w)=Es,aπwold[min(ρt(w)At,clip(ρt(w),1ϵ,1+ϵ)At)]L^{\text{CLIP}}(\boldsymbol{w}) = \mathbb{E}_{s,a \sim \pi_{\boldsymbol{w}_{\text{old}}}}\left[\min\left(\rho_t(\boldsymbol{w}) A_t, \text{clip}(\rho_t(\boldsymbol{w}), 1-\epsilon, 1+\epsilon) A_t\right)\right]

where ϵ\epsilon is a hyperparameter (typically 0.1 or 0.2) and clip(x,a,b)=max(a,min(x,b))\text{clip}(x, a, b) = \max(a, \min(x, b)) restricts xx to the interval [a,b][a, b].

The min\min operator selects the more pessimistic estimate. Consider the two cases:

In both cases, the clipping removes the incentive to move the probability ratio beyond the interval [1ϵ,1+ϵ][1-\epsilon, 1+\epsilon]. This keeps the new policy close to the old policy without explicitly computing or constraining the KL divergence.

The algorithm collects a batch of trajectories, then performs KK epochs of mini-batch updates on the same data. The clipped objective ensures that even after multiple updates, the policy does not move too far from the policy that collected the data. The ratio ρt\rho_t is computed in log-space for numerical stability.

PPO has become one of the most widely used policy gradient algorithms due to its simplicity and robustness. Compared to TRPO, it avoids the computational overhead of constrained optimization while achieving similar sample efficiency. The clip parameter ϵ\epsilon is the main hyperparameter controlling the trust region size: smaller values keep the policy closer to the behavior policy but may slow learning, while larger values allow faster updates but risk instability.

The Policy Gradient Theorem

Sutton et al. (1999) provided an expression for the gradient of the infinite discounted return with respect to the parameters of a parameterized policy. Here is an alternative derivation by considering a bilevel optimization problem:

maxwαvγπw\max_{\mathbf{w}} \alpha^\top \mathbf{v}_\gamma^{\pi_{\boldsymbol{w}}}

subject to:

(IγPπw)vγπw=rπw(\mathbf{I} - \gamma \mathbf{P}_{\pi_{\boldsymbol{w}}}) \mathbf{v}_\gamma^{\pi_{\boldsymbol{w}}} = \mathbf{r}_{\pi_{\boldsymbol{w}}}

The Implicit Function Theorem states that if there is a solution to the problem F(v,w)=0F(\mathbf{v}, \mathbf{w}) = 0, then we can “reparameterize” our problem as F(v(w),w)F(\mathbf{v}(\mathbf{w}), \mathbf{w}) where v(w)\mathbf{v}(\mathbf{w}) is an implicit function of w\mathbf{w}. If the Jacobian Fv\frac{\partial F}{\partial \mathbf{v}} is invertible, then:

dv(w)dw=(F(v(w),w)x)1F(v(w),w)w\frac{d\mathbf{v}(\mathbf{w})}{d\mathbf{w}} = -\left(\frac{\partial F(\mathbf{v}(\mathbf{w}), \mathbf{w})}{\partial \mathbf{x}}\right)^{-1}\frac{\partial F(\mathbf{v}(\mathbf{w}), \mathbf{w})}{\partial \mathbf{w}}

Here we made it clear in our notation that the derivative must be evaluated at root (v(w),w)(\mathbf{v}(\mathbf{w}), \mathbf{w}) of FF. For the remaining of this derivation, we will drop this dependence to make notation more compact.

Applying this to our case with F(v,w)=(IγPπw)vrπwF(\mathbf{v}, \mathbf{w}) = (\mathbf{I} - \gamma \mathbf{P}_{\pi_{\boldsymbol{w}}})\mathbf{v} - \mathbf{r}_{\pi_{\boldsymbol{w}}}:

vγπww=(IγPπw)1(rπww+γPπwwvγπw)\frac{\partial \mathbf{v}_\gamma^{\pi_{\boldsymbol{w}}}}{\partial \mathbf{w}} = (\mathbf{I} - \gamma \mathbf{P}_{\pi_{\boldsymbol{w}}})^{-1}\left(\frac{\partial \mathbf{r}_{\pi_{\boldsymbol{w}}}}{\partial \mathbf{w}} + \gamma \frac{\partial \mathbf{P}_{\pi_{\boldsymbol{w}}}}{\partial \mathbf{w}}\mathbf{v}_\gamma^{\pi_{\boldsymbol{w}}}\right)

Then:

wJ(w)=αvγπww=xα(rπww+γPπwwvγπw)\begin{align*} \nabla_{\mathbf{w}}J(\mathbf{w}) &= \alpha^\top \frac{\partial \mathbf{v}_\gamma^{\pi_{\boldsymbol{w}}}}{\partial \mathbf{w}} \\ &= \mathbf{x}_\alpha^\top\left(\frac{\partial \mathbf{r}_{\pi_{\boldsymbol{w}}}}{\partial \mathbf{w}} + \gamma \frac{\partial \mathbf{P}_{\pi_{\boldsymbol{w}}}}{\partial \mathbf{w}}\mathbf{v}_\gamma^{\pi_{\boldsymbol{w}}}\right) \end{align*}

where we have defined the discounted state visitation distribution:

xαα(IγPπw)1.\mathbf{x}_\alpha^\top \equiv \alpha^\top(\mathbf{I} - \gamma \mathbf{P}_{\pi_{\boldsymbol{w}}})^{-1}.

Remember the vector notation for MDPs:

[rπ]s=aπ(as)r(s,a)[Pπ]ss=aπ(as)P(ss,a)\begin{align*} [\mathbf{r}_{\pi}]_s &= \sum_a \pi(a|s)r(s,a) \\ [\mathbf{P}_{\pi}]_{ss'} &= \sum_a \pi(a|s)P(s'|s,a) \end{align*}

Then taking the derivatives gives us:

[rπww]s=awπw(as)r(s,a)[Pπwwvγπw]s=awπw(as)sP(ss,a)vγπw(s)\begin{align*} \left[\frac{\partial \mathbf{r}_{\pi_{\boldsymbol{w}}}}{\partial \mathbf{w}}\right]_s &= \sum_a \nabla_{\mathbf{w}}\pi_{\boldsymbol{w}}(a|s)r(s,a) \\ \left[\frac{\partial \mathbf{P}_{\pi_{\boldsymbol{w}}}}{\partial \mathbf{w}}\mathbf{v}_\gamma^{\pi_{\boldsymbol{w}}}\right]_s &= \sum_a \nabla_{\mathbf{w}}\pi_{\boldsymbol{w}}(a|s)\sum_{s'} P(s'|s,a)v_\gamma^{\pi_{\boldsymbol{w}}}(s') \end{align*}

Substituting back:

wJ(w)=sxα(s)(awπw(as)r(s,a)+γawπw(as)sP(ss,a)vγπw(s))=sxα(s)awπw(as)(r(s,a)+γsP(ss,a)vγπw(s))\begin{align*} \nabla_{\mathbf{w}}J(\mathbf{w}) &= \sum_s x_\alpha(s)\left(\sum_a \nabla_{\mathbf{w}}\pi_{\boldsymbol{w}}(a|s)r(s,a) + \gamma\sum_a \nabla_{\mathbf{w}}\pi_{\boldsymbol{w}}(a|s)\sum_{s'} P(s'|s,a)v_\gamma^{\pi_{\boldsymbol{w}}}(s')\right) \\ &= \sum_s x_\alpha(s)\sum_a \nabla_{\mathbf{w}}\pi_{\boldsymbol{w}}(a|s)\left(r(s,a) + \gamma \sum_{s'} P(s'|s,a)v_\gamma^{\pi_{\boldsymbol{w}}}(s')\right) \end{align*}

This is the policy gradient theorem, where xα(s)x_\alpha(s) is the discounted state visitation distribution and the term in parentheses is the state-action value function qπw(s,a)q^{\pi_{\boldsymbol{w}}}(s,a).

Normalized Discounted State Visitation Distribution

The discounted state visitation xα(s)x_\alpha(s) is not normalized. Therefore the expression we obtained above is not an expectation. However, we can transform it into one by normalizing by 1γ1 - \gamma. Note that for any initial distribution α\alpha:

sxα(s)=α(IγPπw)11=α11γ=11γ\sum_s x_\alpha(s) = \alpha^\top(\mathbf{I} - \gamma \mathbf{P}_{\pi_{\boldsymbol{w}}})^{-1}\mathbf{1} = \frac{\alpha^\top\mathbf{1}}{1-\gamma} = \frac{1}{1-\gamma}

Therefore, defining the normalized state distribution μα(s)=(1γ)xα(s)\mu_\alpha(s) = (1-\gamma)x_\alpha(s), we can write:

wJ(w)=11γsμα(s)awπw(as)(r(s,a)+γsP(ss,a)vγπw(s))=11γEsμα[awπw(as)qπw(s,a)]\begin{align*} \nabla_{\mathbf{w}}J(\mathbf{w}) &= \frac{1}{1-\gamma}\sum_s \mu_\alpha(s)\sum_a \nabla_{\mathbf{w}}\pi_{\boldsymbol{w}}(a|s)\left(r(s,a) + \gamma \sum_{s'} P(s'|s,a)v_\gamma^{\pi_{\boldsymbol{w}}}(s')\right) \\ &= \frac{1}{1-\gamma}\mathbb{E}_{s\sim\mu_\alpha}\left[\sum_a \nabla_{\mathbf{w}}\pi_{\boldsymbol{w}}(a|s)q^{\pi_{\boldsymbol{w}}}(s,a)\right] \end{align*}

Now we have expressed the policy gradient theorem in terms of expectations under the normalized discounted state visitation distribution. But what does sampling from μα\mu_\alpha mean? Recall that xα=α(IγPπw)1\mathbf{x}_\alpha^\top = \alpha^\top(\mathbf{I} - \gamma \mathbf{P}_{\pi_{\boldsymbol{w}}})^{-1}. Using the Neumann series expansion (valid when γPπw<1\|\gamma \mathbf{P}_{\pi_{\boldsymbol{w}}}\| < 1, which holds for γ<1\gamma < 1 since Pπw\mathbf{P}_{\pi_{\boldsymbol{w}}} is a stochastic matrix) we have:

μα=(1γ)αk=0(γPπw)k\boldsymbol{\mu}_\alpha^\top = (1-\gamma)\alpha^\top\sum_{k=0}^{\infty} (\gamma \mathbf{P}_{\pi_{\boldsymbol{w}}})^k

We can then factor out the first term from this summation to obtain:

μα=(1γ)αk=0(γPπw)k=(1γ)α+(1γ)αk=1(γPπw)k=(1γ)α+(1γ)αγPπwk=0(γPπw)k=(1γ)α+γμαPπw\begin{align*} \boldsymbol{\mu}_\alpha^\top &= (1-\gamma)\alpha^\top\sum_{k=0}^{\infty} (\gamma \mathbf{P}_{\pi_{\boldsymbol{w}}})^k \\ &= (1-\gamma)\alpha^\top + (1-\gamma)\alpha^\top\sum_{k=1}^{\infty} (\gamma \mathbf{P}_{\pi_{\boldsymbol{w}}})^k \\ &= (1-\gamma)\alpha^\top + (1-\gamma)\alpha^\top\gamma\mathbf{P}_{\pi_{\boldsymbol{w}}}\sum_{k=0}^{\infty} (\gamma \mathbf{P}_{\pi_{\boldsymbol{w}}})^k \\ &= (1-\gamma)\alpha^\top + \gamma\boldsymbol{\mu}_\alpha^\top \mathbf{P}_{\pi_{\boldsymbol{w}}} \end{align*}

The balance equation:

μα=(1γ)α+γμαPπw\boldsymbol{\mu}_\alpha^\top = (1-\gamma)\alpha^\top + \gamma\boldsymbol{\mu}_\alpha^\top \mathbf{P}_{\pi_{\boldsymbol{w}}}

shows that μα\boldsymbol{\mu}_\alpha is a mixture distribution: with probability 1γ1-\gamma you draw a state from the initial distribution α\alpha (reset), and with probability γ\gamma you follow the policy dynamics Pπw\mathbf{P}_{\pi_{\boldsymbol{w}}} from the current state (continue). This interpretation directly connects to the geometric process: at each step you either terminate and resample from α\alpha (with probability 1γ1-\gamma) or continue following the policy (with probability γ\gamma).

import numpy as np

def sample_from_discounted_visitation(
    alpha, 
    policy, 
    transition_model, 
    gamma, 
    n_samples=1000
):
    """Sample states from the discounted visitation distribution.
    
    Args:
        alpha: Initial state distribution (vector of probabilities)
        policy: Function (state -> action probabilities)
        transition_model: Function (state, action -> next state probabilities)
        gamma: Discount factor
        n_samples: Number of states to sample
    
    Returns:
        Array of sampled states
    """
    samples = []
    n_states = len(alpha)
    
    # Initialize state from alpha
    current_state = np.random.choice(n_states, p=alpha)
    
    for _ in range(n_samples):
        samples.append(current_state)
        
        # With probability (1-gamma): reset
        if np.random.random() > gamma:
            current_state = np.random.choice(n_states, p=alpha)
        # With probability gamma: continue
        else:
            # Sample action from policy
            action_probs = policy(current_state)
            action = np.random.choice(len(action_probs), p=action_probs)
            
            # Sample next state from transition model
            next_state_probs = transition_model(current_state, action)
            current_state = np.random.choice(n_states, p=next_state_probs)
    
    return np.array(samples)

# Example usage for a simple 2-state MDP
alpha = np.array([0.7, 0.3])  # Initial distribution
policy = lambda s: np.array([0.8, 0.2])  # Dummy policy
transition_model = lambda s, a: np.array([0.9, 0.1])  # Dummy transitions
gamma = 0.9

samples = sample_from_discounted_visitation(alpha, policy, transition_model, gamma)

# Check empirical distribution
print("Empirical state distribution:")
print(np.bincount(samples) / len(samples))
Empirical state distribution:
[0.87 0.13]

While the math shows that sampling from the discounted visitation distribution μα\boldsymbol{\mu}_\alpha would give us unbiased policy gradient estimates, Thomas (2014) demonstrated that this implementation can be detrimental to performance in practice. The issue arises because terminating trajectories early (with probability 1γ1-\gamma) reduces the effective amount of data we collect from each trajectory. This early termination weakens the learning signal, as many trajectories don’t reach meaningful terminal states or rewards.

Therefore, in practice, we typically sample complete trajectories from the undiscounted process (running the policy until natural termination or a fixed horizon) while still using γ\gamma in the advantage estimation. This approach preserves the full learning signal from each trajectory and has been empirically shown to lead to better performance.

This is one of several cases in RL where the theoretically optimal procedure differs from the best practical implementation.

The Actor-Critic Architecture

The policy gradient theorem shows that the gradient depends on the action-value function qπw(s,a)q^{\pi_{\boldsymbol{w}}}(s,a). In practice, we do not have access to the true qq-function and must estimate it. This leads to the actor-critic architecture: the actor maintains the policy πw\pi_{\boldsymbol{w}}, while the critic maintains an estimate of the value function.

This architecture traces back to Sutton’s 1984 thesis, where he proposed the Adaptive Heuristic Critic. The actor uses the critic’s value estimates to compute advantage estimates for the policy gradient, while the critic learns from the same trajectories generated by the actor. The algorithms we developed earlier (REINFORCE with baseline, GAE, and the one-step actor-critic) are all instances of this architecture.

The key challenge is that we are simultaneously learning two functions that depend on each other. The actor’s gradient uses the critic’s estimates, but the critic is trained on data generated by the actor’s policy. If both change too quickly, the learning process can become unstable.

Konda (2002) analyzed this coupled learning problem and established convergence guarantees under a two-timescale condition: the critic must update faster than the actor. Intuitively, the critic needs to “track” the current policy’s value function before the actor uses those estimates to update. If the actor moves too fast, it uses stale or inaccurate value estimates, leading to poor gradient estimates.

In practice, this is implemented by using different learning rates: a larger learning rate αθ\alpha_\theta for the critic and a smaller learning rate αw\alpha_w for the actor, with αθ>αw\alpha_\theta > \alpha_w. Alternatively, one can perform multiple critic updates per actor update. The soft actor-critic algorithm discussed earlier in the amortization chapter follows this same principle, inheriting the actor-critic structure while incorporating entropy regularization and learning Q-functions directly.

The actor-critic architecture also connects to the bilevel optimization perspective of the policy gradient theorem: the outer problem optimizes the policy, while the inner problem solves for the value function given that policy. The two-timescale condition ensures that the inner problem is approximately solved before taking a step on the outer problem.

Reparameterization Methods in Reinforcement Learning

When dynamics are known or can be learned, reparameterization provides an alternative to score function methods. By expressing actions and state transitions as deterministic functions of noise, we can backpropagate through trajectories to compute policy gradients with lower variance than score function estimators.

Stochastic Value Gradients

The reparameterization trick requires that we can express our random variable as a deterministic function of noise. In reinforcement learning, this applies naturally when we have a learned model of the dynamics. Consider a stochastic policy πw(as)\pi_{\boldsymbol{w}}(a|s) that we can reparameterize as a=πw(s,ϵ)a = \pi_{\boldsymbol{w}}(s,\epsilon) where ϵp(ϵ)\epsilon \sim p(\epsilon), and a dynamics model s=f(s,a,ξ)s' = f(s,a,\xi) where ξp(ξ)\xi \sim p(\xi) represents environment stochasticity. Both transformations are deterministic given the noise variables.

With these reparameterizations, we can write an nn-step return as a differentiable function of the noise:

Rn(s0,{ϵi},{ξi})=i=0n1γir(si,ai)R_n(s_0,\{\epsilon_i\},\{\xi_i\}) = \sum_{i=0}^{n-1} \gamma^i r(s_i,a_i)

where ai=πw(si,ϵi)a_i = \pi_{\boldsymbol{w}}(s_i,\epsilon_i) and si+1=f(si,ai,ξi)s_{i+1} = f(s_i,a_i,\xi_i) for i=0,...,n1i=0,...,n-1. The objective becomes:

J(w)=E{ϵi},{ξi}[Rn(s0,{ϵi},{ξi})]J(\boldsymbol{w}) = \mathbb{E}_{\{\epsilon_i\},\{\xi_i\}}[R_n(s_0,\{\epsilon_i\},\{\xi_i\})]

We can now apply the reparameterization gradient estimator:

wJ(w)=E{ϵi},{ξi}[wRn(s0,{ϵi},{ξi})]\nabla_{\boldsymbol{w}}J(\boldsymbol{w}) = \mathbb{E}_{\{\epsilon_i\},\{\xi_i\}}\left[\nabla_{\boldsymbol{w}}R_n(s_0,\{\epsilon_i\},\{\xi_i\})\right]

This gradient can be computed by automatic differentiation through the sequence of policy and model evaluations. The computation requires backpropagating through nn steps of model rollouts, which becomes expensive for large nn but avoids the high variance of score function estimators.

The Stochastic Value Gradients (SVG) framework Heess et al., 2015 uses this approach while introducing a hybrid objective that combines model rollouts with value function bootstrapping:

JSVG(n)(w)=E{ϵi},{ξi}[i=0n1γir(si,ai)+γnq(sn,an;θ)]J^{\text{SVG}(n)}(\boldsymbol{w}) = \mathbb{E}_{\{\epsilon_i\},\{\xi_i\}}\left[\sum_{i=0}^{n-1} \gamma^i r(s_i,a_i) + \gamma^n q(s_n,a_n;\theta)\right]

The terminal value function q(sn,an;θ)q(s_n,a_n;\theta) approximates the value beyond horizon nn, allowing shorter rollouts while still capturing long-term value. This creates a spectrum of algorithms parameterized by nn.

SVG(0): Model-Free Reparameterization

When n=0n=0, the objective collapses to:

JSVG(0)(w)=EsρEϵp(ϵ)[q(s,πw(s,ϵ);θ)]J^{\text{SVG}(0)}(\boldsymbol{w}) = \mathbb{E}_{s \sim \rho}\mathbb{E}_{\epsilon \sim p(\epsilon)}\left[q(s,\pi_{\boldsymbol{w}}(s,\epsilon);\theta)\right]

No model is required. We simply differentiate the critic with respect to actions sampled from the reparameterized policy. This is the approach used in DDPG Lillicrap et al., 2015 (with a deterministic policy where ϵ\epsilon is absent) and SAC Haarnoja et al., 2018 (where ϵ\epsilon produces the stochastic component). The gradient is:

wJSVG(0)=Es,ϵ[aq(s,a;θ)a=πw(s,ϵ)wπw(s,ϵ)]\nabla_{\boldsymbol{w}} J^{\text{SVG}(0)} = \mathbb{E}_{s,\epsilon}\left[\nabla_a q(s,a;\theta)\big|_{a=\pi_{\boldsymbol{w}}(s,\epsilon)} \nabla_{\boldsymbol{w}} \pi_{\boldsymbol{w}}(s,\epsilon)\right]

This requires only that the critic qq be differentiable with respect to actions, not a learned dynamics model. All bias comes from errors in the value function approximation.

SVG(1) to SVG(nn): Model-Based Rollouts

For n1n \geq 1, we unroll a learned dynamics model for nn steps before bootstrapping with the critic. Consider SVG(1):

JSVG(1)(w)=Es,ϵ,ξ[r(s,πw(s,ϵ))+γq(f(s,πw(s,ϵ),ξ),πw(s,ϵ);θ)]J^{\text{SVG}(1)}(\boldsymbol{w}) = \mathbb{E}_{s,\epsilon,\xi}\left[r(s,\pi_{\boldsymbol{w}}(s,\epsilon)) + \gamma q(f(s,\pi_{\boldsymbol{w}}(s,\epsilon),\xi), \pi_{\boldsymbol{w}}(s',\epsilon');\theta)\right]

where s=f(s,πw(s,ϵ),ξ)s' = f(s,\pi_{\boldsymbol{w}}(s,\epsilon),\xi) is the next state predicted by the model. The gradient now flows through both the reward and the model transition. Increasing nn propagates reward information more directly through the model rollout, reducing reliance on the critic. However, model errors compound over the horizon. If the model is inaccurate, longer rollouts can degrade performance.

SVG(\infty): Pure Model-Based Optimization

As nn \to \infty, we eliminate the critic entirely:

JSVG()(w)=E{ϵi},{ξi}[i=0T1γir(si,πw(si,ϵi))]J^{\text{SVG}(\infty)}(\boldsymbol{w}) = \mathbb{E}_{\{\epsilon_i\},\{\xi_i\}}\left[\sum_{i=0}^{T-1} \gamma^i r(s_i,\pi_{\boldsymbol{w}}(s_i,\epsilon_i))\right]

This is pure model-based policy optimization, differentiating through the entire trajectory. Approaches like PILCO deisenroth2011pilco and Dreamer hafner2019dreamer operate in this regime. With an accurate model, this provides the most direct gradient signal. The tradeoff is computational: backpropagating through hundreds of time steps is expensive, and gradient magnitudes can explode or vanish over long horizons.

The choice of nn reflects a fundamental bias-variance tradeoff. Small nn relies on the critic for long-term value estimation, inheriting its approximation errors. Large nn relies on the model, accumulating its prediction errors. In practice, intermediate values like n=5n=5 or n=10n=10 often work well when combined with a reasonably accurate learned model.

Noise Inference for Off-Policy Learning

A subtle issue arises when combining reparameterization with experience replay. SVG naturally supports off-policy learning: states ss can be sampled from a replay buffer rather than the current policy. However, reparameterization requires the noise variables ϵ\epsilon that generated each action.

For on-policy data, we can simply store ϵ\epsilon alongside each transition (s,a,r,s)(s, a, r, s'). For off-policy data collected under a different policy, the noise is unknown. To apply reparameterization gradients to such data, we must infer the noise that would have produced the observed action under the current policy.

For invertible policies, this is straightforward. If a=πw(s,ϵ)a = \pi_{\boldsymbol{w}}(s, \epsilon) with ϵN(0,I)\epsilon \sim \mathcal{N}(0, I), and the policy takes the form a=μw(s)+σw(s)ϵa = \mu_{\boldsymbol{w}}(s) + \sigma_{\boldsymbol{w}}(s) \odot \epsilon (as in a Gaussian policy), we can recover the noise exactly:

ϵ=aμw(s)σw(s)\epsilon = \frac{a - \mu_{\boldsymbol{w}}(s)}{\sigma_{\boldsymbol{w}}(s)}

This recovered ϵ\epsilon can then be used for gradient computation. However, this introduces a subtle dependence: the inferred ϵ\epsilon depends on the current policy parameters w\boldsymbol{w}, not just the data. As the policy changes during training, the same action aa corresponds to different noise values.

For dynamics noise ξ\xi, the situation is more complex. If we have a probabilistic model s=f(s,a,ξ)s' = f(s, a, \xi) and observe the actual next state ss', we could in principle infer ξ\xi. In practice, environment stochasticity is often treated as irreducible: we cannot replay the exact same noise realization. SVG handles this by either: (1) using deterministic models and ignoring environment stochasticity, (2) re-simulating from the model rather than using observed next states, or (3) using importance weighting to correct for the distribution mismatch.

The noise inference perspective connects reparameterization gradients to the broader question of credit assignment in RL. By explicitly tracking which noise realizations led to which outcomes, we can more precisely attribute value to policy parameters rather than to lucky or unlucky samples.

When dynamics are deterministic or can be accurately reparameterized, SVG-style methods offer an efficient alternative to the score function methods developed in the previous section. However, many reinforcement learning problems involve unknown dynamics or dynamics that resist accurate modeling. In those settings, score function methods remain the primary tool since they require only the ability to sample trajectories under the policy.

Sampling-Based Model Predictive Control

Both SVG and the policy gradient methods above optimize a parametric policy πw\pi_{\boldsymbol{w}} that is then deployed. An alternative approach uses the model purely for planning: at each state, optimize an action sequence and execute only the first action. This is model predictive control (MPC). This avoids training a policy network and eliminates generalization error from function approximation, but requires solving an optimization problem at every decision.

For continuous action spaces with complex dynamics, gradient-based trajectory optimization (as in the trajectory optimization chapter) can be expensive. Model Predictive Path Integral control (MPPI) Williams et al., 2017 replaces gradient-based optimization with importance sampling. Given a dynamics model st+1=f(st,at)s_{t+1} = f(s_t, a_t) (deterministic) and current state s0s_0, sample KK action sequences {a(i)}i=1K\{\boldsymbol{a}^{(i)}\}_{i=1}^K and roll them out to get costs C(i)=t=0H1c(st(i),at(i))C^{(i)} = \sum_{t=0}^{H-1} c(s_t^{(i)}, a_t^{(i)}). The optimal action is computed as:

a0=i=1Kw(i)a0(i),w(i)=exp(C(i)/λ)jexp(C(j)/λ)a_0^* = \sum_{i=1}^K w^{(i)} a_0^{(i)}, \quad w^{(i)} = \frac{\exp(-C^{(i)}/\lambda)}{\sum_j \exp(-C^{(j)}/\lambda)}

where λ>0\lambda > 0 is a temperature parameter. Execute a0a_0^*, observe the next state, and repeat.

The weighting w(i)exp(C(i)/λ)w^{(i)} \propto \exp(-C^{(i)}/\lambda) is the Boltzmann distribution from entropy-regularized control. MPPI solves:

minaEξ[t=0H1c(st,at)+λH(π)]\min_{\boldsymbol{a}} \mathbb{E}_{\boldsymbol{\xi}}\left[\sum_{t=0}^{H-1} c(s_t, a_t) + \lambda H(\pi)\right]

where π\pi is the distribution over action sequences and H(π)=E[logπ(a)]H(\pi) = -\mathbb{E}[\log \pi(\boldsymbol{a})] is entropy. The importance sampling estimate approximates the optimal action under this entropy-regularized objective. The temperature λ\lambda controls the trade-off between exploitation (focus on low-cost sequences) and exploration (maintain entropy over sequences).

The Boltzmann weighting connects MPPI to entropy-regularized methods from the amortization chapter. SAC (Algorithm Algorithm 6) learns a stochastic policy πϕ\pi_{\boldsymbol{\phi}} that approximates the Boltzmann distribution over actions at each state. PCL (Algorithm Algorithm 7) minimizes path consistency residuals, exploiting the exact relationship v(s)=q(s,a)αlogπ(as)v^*(s) = q^*(s,a) - \alpha\log\pi^*(a|s) under deterministic dynamics. MPPI uses the Boltzmann distribution directly for action sequence selection but performs no learning. At each state, it samples action sequences, weights them by exponentiated costs, and returns the weighted average.

All three use entropy regularization with Boltzmann weighting. SAC and PCL amortize optimization by learning policies that generalize across states. MPPI performs full optimization at every decision, trading computation for avoiding policy approximation error.

MPPI and SVG represent two ways to use learned models. SVG backpropagates through nn-step model rollouts using the reparameterization trick, requiring differentiable dynamics and policy. MPPI samples many action sequences and aggregates using importance weights, requiring only forward simulation. SVG is more sample-efficient (gradients provide richer signal) but needs differentiable components and backpropagation through long horizons. MPPI is simpler to implement and applies to non-differentiable dynamics, but requires many rollout samples (K1001000K \approx 100-1000) per action selection.

The algorithm samples perturbed action sequences around a nominal trajectory {aˉt}\{\bar{a}_t\} (often the previous optimal sequence, shifted forward). The Boltzmann weights assign high probability to low-cost sequences. The temperature λ\lambda balances exploitation (small λ\lambda focuses on best samples) and exploration (large λ\lambda gives uniform averaging).

MPPI excels at real-time control for systems with fast, accurate models (robotics, autonomous vehicles). The replanning handles model errors and disturbances. However, the per-step computation scales as O(KH)O(KH) model evaluations, making it expensive for complex dynamics or long horizons.

Summary

This chapter developed the mathematical foundations for policy gradient methods. Starting from general derivative estimation techniques in stochastic optimization, we saw two main approaches: the likelihood ratio (score function) method and the reparameterization trick. While the reparameterization trick typically offers lower variance, it requires that the sampling distribution be reparameterizable, making it inapplicable to discrete actions or environments with complex dynamics.

For reinforcement learning, the score function estimator provides a model-free gradient that depends only on the policy parametrization, not the transition dynamics. Through variance reduction techniques (leveraging conditional independence, using control variates, and the Generalized Advantage Estimator), we can make these gradients practical for learning. When models are available, reparameterization through Stochastic Value Gradients offers lower-variance alternatives, with SVG(0) recovering actor-critic methods like DDPG and SAC, and SVG(\infty) representing pure model-based optimization. The resulting algorithms form the foundation of modern policy optimization methods.

The next chapter explores advanced policy optimization techniques including trust region methods, proximal policy optimization, and connections to optimal control theory.

References
  1. Williams, R. J. (1992). Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning, 8(3), 229–256. 10.1007/BF00992696
  2. Sutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (1999). Policy Gradient Methods for Reinforcement Learning with Function Approximation. Advances in Neural Information Processing Systems, 12, 1057–1063.
  3. Konda, V. R. (2002). Actor-Critic Algorithms [Phdthesis]. Massachusetts Institute of Technology.
  4. Heess, N., Wayne, G., Silver, D., Lillicrap, T., Erez, T., & Tassa, Y. (2015). Learning Continuous Control Policies by Stochastic Value Gradients. Advances in Neural Information Processing Systems, 28, 2944–2952.
  5. Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., & Wierstra, D. (2015). Continuous Control with Deep Reinforcement Learning. arXiv Preprint arXiv:1509.02971.
  6. Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. Proceedings of the 35th International Conference on Machine Learning (ICML), 1861–1870.
  7. Williams, G., Aldrich, A., & Theodorou, E. A. (2017). Model Predictive Path Integral Control: From Theory to Parallel Computation. Journal of Guidance, Control, and Dynamics, 40(2), 344–357. 10.2514/1.G001921